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1 Organization and performance of a two-level virtual-real cache hierarchy 100% 
3 W. H. Wang , J.-L. Baer , H. M. Levy 

ACM SIGARCH Computer Architecture News , Proceedings of the 16th annual 
international symposium on Computer architecture April 1989 
Volume 17 Issue 3 

We propose and analyze a two-level cache organization that provides high memory bandwidth. 
The first-level cache is accessed directly by virtual addresses. It is small, fast, and, without the 
burden of address translation, can easily be optimized to match the processor speed. The 
virtually-addressed cache is backed up by a large physically-addressed cache; this second-level 
cache provides a high hit ratio and greatly reduces memory traffic. We show how the 
second-level cache can be easily e ... 

2 Articles: Division of Labor in Embedded Svstems 100% 
(3 Ivan Goddard 

Queue April 2003 
Volume 1 Issue 2 



3 Pipeline behavior prediction for superscalar processors by abstract interpretation 100% 
[3 Jom Schneider , Christian Ferdinand 

ACM SIGPLAN Notices , Proceedings of the ACM SIGPLAN 1999 workshop on 

Languages, compilers, and tools for embedded systems May 1999 

Volume 34 Issue 7 

For real time systems not only the logical fimction is important but also the timing behavior, e. 
g. hard real time systems must react inside their deadlines. To guarantee this it is necessary to 
know upper bounds for the worst case execution times (WCETs). The accuracy of the 
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prediction of WCETs depends strongly on the ability to model the features of the target 
processor.Cache memories, pipelines and parallel functional units are architectural 
components which are responsible for the speed gai ... 



4 An effective write policy for software coherence schemes 
[3 Y.-C. Chen , A. V. Veidenbaum 

Proceedings of the 1992 ACM/IEEE conference on Supercomputing December 1992 



5 An improved replacement strategy for function caching 99% 
13 William Pugh 

Proceedings of the 1988 ACM conference on LISP and functional programming January 
1988 

Function caching is the technique of remembering previous function calls and avoiding the 
cost of recomputing them. Function caching provides a simple way of implementing dynamic 
programming algorithms and can provide a facility for incremental computation. Previous 
discussions of function caching have generally relied on the user to purge items from the 
function cache or have proposed a strategy such as least-recently-used without any analysis of 
the appropriateness of tha ... 



6 Efficient and realistic simulation of disk cache performance 
13 John F. Cigas 

Proceedings of the 21st annual symposium on Simulation January 1988 
This paper describes an improved method for evaluating disk cache performance using trace 
driven simulation. This method differentiates between reads and writes in the trace data which 
results in higher miss ratios than when all traced events are treated alike. It allows for 
simulating various update policies and update intervals, physical blocks that are part of 
different files at different times, and the optimum replacement policy GOPT. These methods 
are applied to traces fi"om a VAX 1 1/78 ... 



7 Session 84. 1 : power in memory and network processors: Embedded cache architecture with 
3 programmable write buffer support for power and performance flexibility 
Afzal Malik , Bill Moyer , Roger Zhou 

Proceedings of the international conference on Compilers, architecture, and synthesis for 
embedded systems October 2002 

Next generation portable devices are placing stringent requirements on overall system power 
and performance. Voice recognition, streaming video and high speed wire^less internet access 
are just some of the features being incorporated in these handheld electronic gadgets. The 
M^CORE M341-S processor has been designed for high performance and cost sensitive 
portable products as well as for high end embedded control applications. M341-S obtains 
increased performance over the M^CORE M2 and M310 fami ... 



8 The Wisconsin multicube: a new large-scale cache-coherent multiprocessor 
13 J. R. Goodman , P. J. Woest 

ACM SIGARCH Computer Architecture News , Proceedings of the 15th Annual 

International Symposium on Computer architecture May 1988 

Volume 16 Issue 2 

The Wisconsin Multicube, is a large-scale, shared-memory multiprocessor architecture that 
employs a snooping cache protocol over a grid of buses. Each processor has a conventional 
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(SRAM) cache optimized to minimize memory latency and a large (DRAM) snooping cache 
optimized to reduce bus traffic and to maintain consistency. The large snooping cache should 
guarantee that nearly all the traffic on the buses will be generated by I/O and accesses to 
shared data. Thep ... 

9 I/O reference behavior of production database workloads and the TPC benchmarks—an 99% 
Ql analysis at the logical level 

Windsor W. Hsu , Alan Jay Smith , Honesty C. Young 

ACM Transactions on Database Systems (TODS) March 2001 

Volume 26 Issue 1 

As improvements in processor performance continue to far outpace improvements in storage 
performance, I/O is increasingly the bottleneck in computer systems, especially in large 
database systems that manage huge amoungs of data. The key to achieving good I/O 
performance is to thoroughly understand its characteristics. In this article we present a 
comprehensive analysis of the logical I/O reference behavior of the peak productiondatabase 
workloads from ten of the world's largest corporatio ... 

1 0 Joumaling with ReisersFS 99% 
2l Chris Mason 

Linux Journal February 2001 

Mason gives a tour through the Reiser File System: its features and construction. 

11 An architectural perspective on a memory access controller 99% 
13 M. Freeman 

Proceedings of the 14th annual international symposium on Computer architecture June 
1987 

In this paper a CMOS memory access controller chip is described that provides the basis for 
achieving high-performance 68020-based (68030-based) systems. This controller matches the 
speed of the memory system to that of the microprocessor by providing a virtual cache 
mechanism where address translations are only required when there is a cache miss. This 
mechanism also facilitates the construction of shared-memory multiprocessor system where 
the controller manages ... 

12 Prefetching in segmented disk cache for multi-disk svstems 99% 
Q Valery Soloviev 

Proceedings of the fourth workshop on I/O in parallel and distributed systems: part of 
the federated computing research conference May 1996 

13 Avoiding conflict misses dynamically in large direct-mapped caches 99% 
13 Brian N. Bershad , Dennis Lee , Theodore H. Romer , J. Bradley Chen 

Proceedings of the sixth international conference on Architectural support for 
programming languages and operating systems November 1994 
Volume 29 , 28 Issue 11,5 

This paper describes a method for improving the performance of a large direct-mapped cache 
by reducing the nimiber of conflict misses. Our solution consists of two components: an 
inexpensive hardware device called a Cache Miss Lookaside (CML) buffer that detects 
conflicts by recording and summarizing a history of cache misses, and a software policy within 
the operating system's virtual memory system that removes conflicts by dynamically 
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remapping pages whenever large numbers of conflict miss ... 

14 A coherent distributed file cache with directory write-behind 

3 Timothy Mann , Andrew Birrell , Andy Hisgen , Charles Jerian , Garret Swart 

ACM Transactions on Computer Systems (TOCS) May 1994 

Volume 12 Issue 2 

Extensive caching is a key feature of the Echo distributed file system. Echo client machines 
maintain coherent caches of file and directory data and properties, with write-behind (delayed 
write-back) of all cached information. Echo specifies ordering constraints on this write-behind, 
enabling applications to store and maintain consistent data structures in the file system even 
when crashes or network faults prevent some writes fi"om being completed. In this paper we 
describe ... 



15 High-speed buffering for variable length operands 
3 H. L. Tredennick , T. A. Welch 

ACM SIGARCH Computer Architecture News , Proceedings of tlie 4th annual 

symposium on Computer architecture March 1977 

Volume 5 Issue 7 

Variable word length processing is valuable for data base manipulations, editing fimctions in 
time-sharing systems, input-output data formatting, and vector operations, but current 
computer architectures seldom provide efficient means for manipulating variable length 
operands. A specialized computer architecture has been proposed to deal with the problems of 
variable length byte string processing. Operand buffering is a key part of the proposed 
architecture because the buffer: (1) replaces ... 



16 Architecture 2: An interleaved cache clustered VLIW processor 

3 Enric Gibert , Jesus Sanchez , Antonio Gonzalez 

Proceedings of the 16th international conference on Supercomputing Jime 2002 
Clustered microarchitectures are becoming a common organiza^tion due to their potential to 
reduce the penalties caused by wire delays and power consumption. Fully-distributed 
architectures are particularly effective to deal with these constraints, and besides they are very 
scalable. However, the distribution of the data cache memory poses a significant challenge and 
may be crit^ical for performance. In this work, a distributed data cache VLIW architectiare 
based on an interleaved cache organizati ... 



17 The development of the MU5 computer svstem 
Q R. N. Ibbett , P. C. Capon 

Communications of the ACM January 1978 

Volume 21 Issue 1 

Following a brief outline of the background of the MU5 project, the aims and ideas for MU5 
are discussed. A description is then given of the instruction set, which includes a number of 
featxires conducive to the production of efficient compiled code fi-om high-level language 
source programs. The design of the processor is then traced fi:om the initial ideas for an 
associatively addressed “name store” to the final multistage pipeline structure 
involving a prediction mechanism for in ... 
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Glenn Reinman , Brad Calder , Todd Austin 
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Proceedings of the 32nd annual ACM/IEEE international symposium on 
Microarchitecture November 1999 

Instruction supply is a crucial component of processor performance. Instruction prefetching 
has been proposed as a mechanism to help reduce instruction cache misses, which in turn can 
help increase instruction supply to the processor. In this paper we examine a new instruction 
prefetch architectxu'e called Fetch Directed Prefetching, and compare it to the performance of 
next-line prefetching and streaming buffers. This architecture uses a decoupled b ... 

19 Microarchitecture support for improving the performance of load target prediction 99% 
3 Chung-Ho Chen , Akida Wu 

Proceedings of the 30th annual ACM/IEEE international symposium on 
Microarchitecture December 1997 

Presents a load target prediction scheme that mitigates the impact of load latency for modem 
microprocessors. The scheme uses a cache-like buffer to provide the base address, offset and 
operand size at the instruction fetching stage of a pipeline so that a load target address can be 
computed earlier at the decode stage. With the dynamic use of a load stride, the scheme has 
achieved a prediction rate that is 15% higher than a previously proposed approach. By 
providing a 128-entry direct-mapped 1 ... 

20 Architectural primitives for a scalable shared memorv multiprocessor 99% 
13 Joonwon Lee , Umakishore Ramachandran 

Proceedings of the third annual ACM symposium on Parallel algorithms and 
architectures June 1991 
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1 Multiprocessor Architectures For Concurrent Programs 100% 
3 Per Brinch Hansen 

Proceedings of the 1978 annual conference December 1978 

This paper proposes a hierarchical multiprocessor architecture for real-time programs written 
in a concurrent programming language. The use of processes and monitors leads to a 
multiprocessor system in which each processor has a local store dedicated to a single process. 
The processors share a common store that contains the monitors. To avoid congestion in the 
common store the processes and monitors are partitioned into subsystems that share a 
hierarchy of common stores. The main goal is to ... 



2 Multiprocessor architectures for concurrent programs 
13 Per Brinch Hansen 

ACM SIGARCH Computer Architecture News December 1978 

Volume 7 Issue 4 

This paper proposes a hierarchical multiprocessor architecture for real-time programs written 
in a concurrent programming language. The use of processes and monitors leads to a 
multiprocessor system in which each processor has a local store dedicated to a single process. 
The processors share a common store that contains the monitors. To avoid congestion in the 
common store the processes and monitors are partitioned into subsystems that share a 
hierarchy of common stores. The main goal is to deve ... 



3 Runtime identification of cache conflict misses: The adaptive miss buffer 100% 
(3 Jamison D. Collins , Dean M. TuUsen 

ACM Transactions on Computer Systems (TOCS) November 2001 

Volume 19 Issue 4 

This paper describes the miss classification table, a simple mechanism that enables the 
processor or memory controller to identify each cache miss as either a conflict miss or a 
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capacity (non-conflict) miss. The miss classification table works by storing part of the tag of 
the most recently evicted line of a cache set. If the next miss to that cache set has a matching 
tag, it is identified as a conflict miss. This technique correctly identifies 88&percnt; of 
misses.Several applications of this i ... 



4 Third Generation Computer Systems 
2l Peter J. Denning 

ACM Computing Surveys (CSUR) December 1971 

Volume 3 Issue 4 

The common features of third generation operating systems are surveyed from a general view, 
with emphasis on the conmion abstractions that constitute at least the basis for a 
“theory” of operating systems. Properties of specific systems are not discussed 
except where examples are useful. The technical aspects of issues and concepts are stressed, 
the nontechnical aspects mentioned only briefly. A perfunctory knowledge of third generation 
systems is presumed. 



5 Delay streams for graphics hardware 

13 Timo Aila , Ville Miettinen , Petri Nordlund 

ACM Transactions on Graphics (TOG) July 2003 

Volume 22 Issue 3 

In causal processes decisions do not depend on future data. Many well-known problems, such 
as occlusion culling, order-independent transparency and edge antialiasing cannot be properly 
solved using the traditional causal rendering architectures, because future data may change the 
interpretation of current events. We propose adding a delay stream between the vertex and 
pixel processing units. While a triangle resides in the delay stream, subsequent triangles 
generate occlusion information. ... 



6 Protocol verification using reachability analysis: the state space explosion problem and relief 99% 
3 strategies 

F. J. Lin , P. M. Chu , M. T. Liu 

ACM SIGCOMM Computer Communication Review , Proceedings of the ACM 
workshop on Frontiers in computer communications technology August 1987 
Volume 1 7 Issue 5 

Reachability analysis has proved to be one of the most effective methods in verifying 
correctness of communication protocols based on the state transition model. Consequently, 
many protocol verification tools have been built based on the method of reachability analysis. 
Nevertheless, it is also well known that state space explosion is the most severe limitation to 
the applicability of this method. Although researchers in the field have proposed various 
strategies to relieve this intricate p ... 



7 Detailed design and evaluation of redundant multithreading alternatives 
3 Shubhendu S. Mukherjee , Michael Kontz , Steven K. Reinhardt 

ACM SIGARCH Computer Architecture News May 2002 

Volume 30 Issue 2 

Exponential growth in the number of on-chip transistors, coupled with reductions in voltage 
levels, makes each generation of microprocessors increasingly vuhierable to transient faults. In 
a multithreaded environment, we can detect these faults by running two copies of the same 
program as separate threads, feeding them identical inputs, and comparing their outputs, a 
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technique we call Redundant Multithreading (RMT).This paper studies RMT techniques in the 
context of both single- and dual-process ... 

8 Predictor-directed stream buffers 99% 
3 Timothy Sherwood , Suleyman Sair , Brad Calder 

Proceedings of the 33r(l annual ACM/IEEE international symposium on 
Microarchitecture December 2000 

9 Performance of database workloads on shared-memory systems with out-of-order processors 99% 
(3 Parthasarathy Ranganathan , Kourosh Gharachorloo , Sarita V. Adve , Luiz Andre Barroso 

Proceedings of the eighth international conference on Architectural support for 
programming languages and operating systems October 1998 
Volume 33 , 32 Issue 11,5 

Database applications such as online transaction processing (OLTP) and decision support 
systems (DSS) constitute the largest and fastest-growing segment of the market for 
multiprocessor servers. However, most current system designs have been optimized to perform 
well on scientific and engineering workloads. Given the radically different behavior of 
database workloads (especially OLTP), it is important to re-evaluate key system design 
decisions in the context of this important class of applicatio ... 

10 Trace-driven memory simulation: a survey 99% 
13 Richard A. Uhlig , Trevor N. Mudge 

ACM Computing Surveys (CSUR) June 1997 
Volume 29 Issue 2 

As the gap between processor and memory speeds continues to widen, methods for evaluating 
memory system designs before they are implemented in hardware are becoming increasingly 
important. One such method, trace-driven memory simulation, has been the subject of intense 
interest among researchers and has, as a result, enjoyed rapid development and substantial 
improvements during the past decade. This article surveys and analyzes these developments by 
establishing criteria for evaluating trac ... 

11 Classification and performance evaluation of instruction buffering techniques 99% 
13 Lizyamma Kurian , Paul T. Hulina , Lee D. Coraor , Dhamir N. Mannai 

ACM SIGARCH Computer Architecture News , Proceedings of the 18th annual 
international symposium on Computer architecture April 1991 
Volume 19 Issue 3 

12 The NYU Ultracomputer&mdash:designing a MIMD, shared-memory parallel machine 99% 
(Extended Abstract) 

Allan Gottlieb , Ralph Grishman , Clyde P, Kruskal , Kevin P. McAuUffe , Larry Rudolph , 
Marc Snir 

Proceedings of the 9th annual symposium on Computer Architecture April 1982 
We present the design for the NYU Ultracomputer, a shared-memory MIMD parallel machine 
composed of thousands of autonomous processing elements. This machine uses an enhanced 
message switching network with the geometry of an Omega-network to approximate the ideal 
behavior of Schwartz's paracomputer model of computation and to implement efficiently the 
important fetch-and-add synchronization primitive. We outline the hardware that would be 
required to build a 4096 processor system using 1990* ... 
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13 The holodeck ray cache: an interactive rendering system for global illumination in nondiffuse 99% 
environments 

Gregory Ward , Maryann Simmons 

ACM Transactions on Graphics (TOG) October 1999 

Volume 18 Issue 4 

We present a new method for rendering complex environments using interactive, progressive, 
view-independent, parallel ray tracing. A four-dimensional holodeck data structure serves as a 
rendering target and caching mechanism for interactive walk-throughs of nondiffuse 
environments with full global illumination. Ray sample density varies locally according to 
need, and on-demand ray computation is supported in a parallel implementation. The holodeck 
file is stored on disk and ... 



14 Multiple instruction issue in the NonStop cyclone processor 99% 
13 Robert W. Horst , Richard L. Harris , Robert L. Jardine 

ACM SIGARCH Computer Architecture News , Proceedings of the 17th annual 
international symposium on Computer Architecture May 1990 
Volume 18 Issue 3 

This paper describes the architecture for issuing multiple instructions per clock in the NonStop 
Cyclone Processor. Pairs of instructions are fetched and decoded by a dual two-stage prefetch 
pipeline and passed to a dual six-stage pipeline for execution. Dynamic branch prediction is 
used to reduce branch penalties. A unique microcode routine for each pair is stored in the large 
duplexed control store. The microcode controls parallel data paths optimized for executing the 
most frequent instr ... 

15 The NYU ultracomputer&mdash:designing a MIMD, shared-memory parallel machine 99% 
Q Allan Gottlieb , Ralph Grishman , Clyde P. Kruskal , Kevin P. McAuliffe , Larry Rudolph , 

Marc Snir 

25 years of the international symposia on Computer architecture (selected papers) 

August 1998 

16 ECOSystem: managing energy as a first class operating system resource 98% 
13 Heng Zeng , Carla S. Ellis , Alvin R. Lebeck , Amin Vahdat 

Tenth international conference on architectural support for programming languages 
and operating systems on Proceedings of the 10th international conference on 
architectural support for programming languages and operating systems (ASPLOS-X) 
October 2002 

Volume 37 , 30 , 36 Issue 10,5,5 

Energy consumption has recently been widely recognized as a major challenge of computer 
systems design. This paper explores how to support energy as a first-class operating system 
resource. Energy, because of its global system nature, presents challenges beyond those of 
conventional resource management. To meet these challenges we propose the Currentcy 
Model that unifies energy accounting over diverse hardware components and enables fair 
allocation of available energy among applications. Our par ... 



17 The design and implementation of a progressive on-demand image dissemination system for 98% 
Q very large images 

Michael J. Owen , Andrew K. Lui , Edward H. S. Lo , Mark W, Grigg 
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Australian Computer Science Communications , Proceedings of the 24th Australasian 
conference on Computer science January 2001 
Volume 23 Issue 1 

The use of progressive, on-demand image dissemination techniques can support efficient 
dissemination of very large images across networks. In this paper we examine the 
effectiveness of various design options in developing such on-demand dissemination systems. 
We show that the choice of the design options can have a profound impact on the efficient use 
of cUent, server, and network resources. Based on our performance evaluation experiments, we 
recommend that efficient dissemination can be achiev ... 
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